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A Model Approximation Scheme for Planning in Partially 
Observable Stochastic Domains 



Partially observable Markov decision processes (POMDPs) are a natural model for 
planning problems where effects of actions are nondeterministic and the state of the world 
is not completely observable. It is difficult to solve POMDPs exactly. This paper pro- 
poses a new approximation scheme. The basic idea is to transform a POMDP into another 
one where additional information is provided by an oracle. The oracle informs the plan- 
ning agent that the current state of the world is in a certain region. The transformed 
POMDP is consequently said to be region observable. It is easier to solve than the origi- 
nal POMDP. We propose to solve the transformed POMDP and use its optimal policy to 
construct an approximate pohcy for the original POMDP. By controUing the amount of ad- 
ditional information that the oracle provides, it is possible to find a proper tradeoff between 
computational time and approximation quality. In terms of algorithmic contributions, we 
study in details how to exploit region observability in solving the transformed POMDP. 
To facilitate the study, we also propose a new exact algorithm for general POMDPs. The 
algorithm is conceptually simple and yet is significantly more efficient than all previous 
exact algorithms. 

1. Introduction 

In a completely observable and deterministic world, to plan is to find a sequence of actions 
that will lead an agent to achieve a goal. In real-world applications, the world is rarely com- 
pletely observable and effects of actions are almost always nondeterministic. For this reason, 
a growing number of researchers concern themselves with planning in partially observable 
stochastic domains (e.g.. Dean & Wellman, 1991; Cassandra et al, 1994; Parr & Russell, 
1995; Boutilier &; Poole, 1996). Partially observable Markov decision processes (POMDPs) 
can be used as a model for planning in such domains. In this model, nondeterminism in 
effects of actions is encoded by transition probabilities, partial observability of the world 
by observation probabilities, and goals and criteria for good plans by reward functions (see 
Section 2 for details). 

POMDPs are classified into finite horizon POMDPs and infinite horizon POMDPs de- 
pending on the number of time points considered. Infinite horizon POMDPs are usually 
used for planning because one typically does not know beforehand the number of steps 
it takes to achieve a goal. This paper is concerned with how to solve an infinite horizon 
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1.1 Difficulties in Solving POMDPs 

When the world is fully observable, a POMDP reduces to a Markov decision process (MDP). 
MDPs have been studied extensively in the dynamic-programming literature (e.g., Puter- 
man, 1990; Bertsekas, 1987, White, 1993). Recent works have concentrated on how to deal 
with large state spaces (Dean et ai, 1993; Boutilier et ai, 1995; Dean &; Lin, 1995). 

We are concerned with the partially observable case. This case is considerably more 
difficult than the fully observable case for two related reasons. First, when the agent knows 
exactly in which state the world currently is, information from the past (past observations 
and actions) is irrelevant to the current decision. This is the Markov property. On the 
other hand, when the agent does not fully observe the state of the world, past information 
becomes relevant because it can help the agent to better estimate the current state of the 
world. The problem is that the number of possible states of past information increases 
exponentially with time. 

Second, in MDPs the effects of an action are fully observed at the next time point. In 
POMDPs, on the other hand, the effects of an action are not fully observed at the next 
time point. Hence one cannot clearly tell the effects of the current action from those of the 
agent's future behaviors. To properly evaluate the effects of an action, one needs to look 
into the future and consider the combination of the action with each of the agent's possible 
behaviors in a, possibly large, number of future steps. The problem is that the number of 
ways the agent can behave is exponential in the number of future steps considered. 

1.2 Previous Work 

Previous methods for solving POMDPs are usually classified into exact methods and approx- 
imate methods (Lovejoy, 1991a). They can also can be classified according to which of the 
aforementioned two difficulties they directly address. Most previous methods address the 
difficulty of exponential number of future behaviors (Sondik, 1971; Sondik & Mendelssohn, 
1979; Monahan, 1982; Cheng, 1988; Lovejoy, 1991b; and Cassandra et ai, 1994). They 
prune from consideration behaviors that can never be optimal no matter what the infor- 
mation state is (Section 4). Other methods deal with the problem of exponential number 
of past information states either by aggregating them (Platzman, 1977; White & Schere, 
1994) or by considering only a subset of them (Lovejoy, 1992; Brafman, 1997; Hauskrecht, 
1997). They are approximation methods in nature. 

1.3 Model Approximation 

In previous approximation methods, approximation takes place in the process of solving 
a POMDP. We advocate model approximation methods. Such a method approximates a 
POMDP itself by another one that is easier to solve and uses the solution of the latter to 
construct an approximate solution to the original POMDP. 

Model approximation can be in the form of a more informative observation model, or 
a more deterministic action model, or a simpler state space, or a combination of two or all 
of the three alternatives. Cassandra et al. (1996) proposed to approximate POMDPs by 
using MDPs. This is an example of model approximation in the form of a more informative 
observation model. There is also some work on reducing the size of the state spaces of MDPs 



200 



Model Approximation for Planning under Uncertainty 



by aggregation (e.g., Bertsekas &; Castanon, 1989; Dean &: Lin, 1995; Dean &; Givan, 1997). 
Such work can be conceivably extended to POMDPs, leading to model approximation in 
the form of a simpler state space. We are not aware of any model approximation schemes 
in the form of a more deterministic action model. 

1.4 Our Proposal 

This paper proposes a new model approximation scheme in the form of a more informative 
observation model. It is a generalization of the idea of approximating POMDPs by using 
MDPs. 

We transform a POMDP by assuming that, in addition to the observations obtained by 
itself, the agent also receives a report from an oracle who knows the true state of the world. 
The oracle does not report the true state itself. Instead, it selects, from a predetermined 
list of candidate regions, a region that contains the true state and reports that region. The 
transformed POMDP is said to be region observable because the agent knows for sure that 
the true state is in the region reported by the oracle. 

When all candidate regions are singletons, the oracle actually reports the true state of 
the world. The region observable POMDP reduces to an MDP. MDPs are much easier to 
solve than POMDPs. One would expect the region observable POMDP to be solvable when 
all candidate regions are small. 

In terms of approximation quality, the larger the candidate regions, the less additional 
information the oracle provides and hence the more accurate the approximation. In the 
extreme case when there is only one candidate region and it consists of all possible states 
of the world, the oracle provides no additional information at all. Consequently, the region 
observable POMDP is identical to the original POMDP. 

A method for determining approximation quality will be described later in this paper. 
It allows one to make the tradeoff between approximation quality and computational time 
as follows: start with small candidate regions and increase their sizes gradually until the 
approximation becomes accurate enough or the region observable POMDP becomes un- 
tractable. 

Due to problem characteristics, accurate approximation can usually be achieved with 
small candidate regions. In many applications the agent often has a good idea about the 
state of the world (e.g., Simmons Sz Koenig, 1995). Take robot path planning as an example. 
Observing a landmark, a room number for instance, would imply that the robot is at the 
proximity of that landmark. Observing a feature about the world, a corridor T-junction for 
instance, might imply the robot is in one of several regions. Taking history into account, the 
robot might be able to determine a unique region for its current location. Also, an action 
usually moves the state of the world to only a few "nearby" states. Thus if the robot has a 
good idea about the current state of world, it should continue to have a good idea about it 
in the next few steps. 

When the agent has a good idea about the state of the world at all times, the oracle 
does not provide much additional information even with small candidate regions and hence 
approximation is accurate. Region observable POMDPs with small candidate regions are 
much easier to solve than general POMDPs. 
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1.5 Organization 

We will first show how POMDPs can be used as a model for planning in partially observable 
stochastic domains (Section 2) and give a concise review of the theory of POMDPs (Sections 
3 and 4). We will then propose a new method for dynamic-programming updates, a key 
step in algorithms that solve POMDPs via value iteration (Section 5). Thereafter, we will 
formally introduce the concept of region observable POMDPs (Section 6) and develop an 
algorithm for solving region observable POMDPs (Sections 7, 8, and 9). In Section 10, 
we will discuss decision making for the original POMDPs based on the solutions of their 
region observable approximations, followed by a method for determining approximation 
quality (Section 11) and a method to make the tradeoff between approximation quality and 
computational time (Section 12). Finally, empirical results will be reported in Section 13 
and conclusions will be provided in Section 14. 

2. Planning in Stochastic Domains and POMDPs 

To specify a planning problem, one needs to give a set S of possible states of the world, 
a set O of possible observations, and a set A of possible actions. The sets O and A are 
always assumed to be finite in the literature, while the state space S can be continuous as 
well as finite. In this paper, we consider only finite state space. One needs also to give an 
observation model which describes the relationship between observations and the state of 
the world, and an action model which describes effects of actions. 

As a background example, consider path planning for a robot who acts in an office 
environment. Here S is the set of all location-orientation pairs, O is the set of possible 
sensor readings, and A consists of actions move-forward, turn-left, turn-right, and 
declare-goal. 

The current observation o depends on the current state of the world s. Due to sensor 
noise, this dependency is uncertain in nature. The observation o sometimes also depends 
on the action that the robot has just taken a.. The minus sign in the subscript indicates 
the previous time point. In the POMDP model, the dependency of o upon s and a- is 
numerically characterized by a conditional probability P{o\s,a-), which is usually referred 
to as the observation probability. It is the observation model. 

In a region observable POMDP, the current observation also depends on the previ- 
ous state of the world s.. The observation probability for this case can be written as 
P{o\s, a-, S-). 

The state s+ the world will be in at the next time point depends on the current action a 
and the current state s. The plus sign in the subscript indicates the next time point. This 
dependency is again uncertain in nature due to uncertainty in the actuator. In the POMDP 
model, the dependency of s+ upon s and a is numerically characterized by a conditional 
probability P{s^\s,a), which is usually referred to as the transition probability. It is the 
action model. 

On many occasions, we need to consider the joint conditional probability -P(s+, o+js, a) 
of the next state of the world and the next observation given the current state and the 
current action. It is given by 

P(s+, o_,_|s, a) = P{s^\s,a)P{o^\s^,a,s). 
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Knowledge about the initial state, if available, is represented as a probability distribution 
Pq over S. When the agent knows the initial state with certainty, Pq is 1 at the initial state 
and everywhere else. The planning goal is encoded by a reward function such as the 
following: 



The preference for short plans is encoded by discounting future rewards with respect to the 
current reward (see the next section). 

In summary, a POMDP consists of a set of possible states of the world, a set of possible 
observations, a set of possible actions, a observation probability, a transition probability, 
and a reward function. An MDP has the same ingredients as an POMDP except that it has 
no observation probability. This is because the state of the world is completely observed in 
an MDP. 

3. Basics of POMDPs 

This section reviews several concepts and results related to POMDPs. 
3.1 Belief States 

In a POMDP, an agent chooses and executes an action at each time point. The choice 
is made based on information from the past (past observations and past actions) and the 
current observation. The amount of memory required to store past observations and actions 
increases linearly with time. This makes it difficult to maintain past information after a 
long period of time. 

The standard way to overcome this difficulty is to maintain, instead of past information, 
the agent's belief state — the probability distribution P(st\ot, cit-i, ot_i, . . . , ai, oi, Pq) of the 
current state St of the world given past information and the current observation. It is well 
known that the belief state is a sufficient statistic in the sense that it captures all the 
information contained in past information and the current observation that is useful for 
action selection. Hence the agent can base its decision solely on the belief state. 

Compared with maintaining past information, maintaining the belief state is desirable 
because the number of possible states of the world is finite. One needs only to maintain a 
fixed and finite number of probability values^. 

The initial belief state is Pq. One question is how the agent should update its belief 
state as time goes by. Following Littman (1994), we use b to denote a belief state. For any 
state s, b{s) is the probability that the world is in state s. The set of all possible belief 
states will be denoted by B. 

Suppose b is the current belief state, and a is the current action. If the observation o+ 
is obtained at the next time point, then the agent should update its belief state from b to 
a new belief state b+ given by 



1. Storing these values exactly could require unbounded precision. Approximations are implicitly being 
made due to the fact that machine precision is bounded. 




1 if a=declare-goal and s=goal 
otherwise. 



b+{s+) = kY, o+|s, a)b{s), 



(1) 



s 
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where k=l/ Y^^ ..^ P{s+,o+\s,a)b{s) is the normalization constant (e.g., Littman, 1994). 

3.2 POMDPs as MDPs 

For any belief state b and any action a, define 

r{b,a) = Y,b{s)r{s,a). (2) 

s 

It is the expected immediate reward for taking action a in belief state b. 
For any belief state b, any action a, and any observation o+, define 

P(o+|6, a) = Y^ P{s+, o+|s, a)b{s). (3) 

It is the probability of observing o+ at the next time point given that the current belief state 
is b and the current action is a. Let b+ be the belief state given by equation (1). P{o+\b, a) 
can also be understood as the probability of the next belief state being b+ given that the 
current action is a and the current belief state is b. 

A POMDP over world state space S can be viewed as an MDP over the belief state 
space B. The the reward function and the transition probability of the MDP are given by 
Equations (2) and (3) respectively. 

3.3 Optimal Policies 

At each time point, the agent consults its belief state and chooses an action. A policy tt 
prescribes an action for each possible belief state. Formally it is a mapping from B to A. 
For each belief state 6, 7r(6) is the action prescribed by tt for b. 

Suppose bo is the current belief state. If an agent follows a policy tt, then its current 
action is 7r(6o) and the immediate reward is ro{b,n{bo)); with probability P(o_,_|6o, 7r(6o))! 
the agent's next belief state bi will be as given by Equation (1), the next action will be 
7r(6i), and the next reward will be ri(6i, 7r(6i)); and so on and so forth. The quality of 
a policy is measured by the expected discounted rewards it garners. Formally the value 
function of a policy tt is defined for each belief state bo to be the following expectation: 

oo 

V^bo) = E,^[Y,Yn{b„n{k))], (4) 

2 = 

where 0<7<1 is the discount factor. 

A policy TTi dominates another policy 7r2 if for each belief state b^B 

> (5) 

Domination is a partial ordering among policies. It is well known that there exists a policy 
that dominates all other policies (e.g., Puterman, 1990). Such a policy is called an optimal 
policy. The value function of an optimal policy is called the optimal value function and is 
denoted by V*. 
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3.4 Value Iteration 

Value iteration is a standard way for solving infinite horizon MDPs (Bellman, 1957). It 
begins with an arbitrary initial function Vq*(6) and improves it iteratively by using the 
following equation 

V;{b) = maxMb, a) + 7 E ^Kl^' ^Wt-iiK)], (6) 

0+ 

where 6+ is the belief state given by equation (1). If Vo*=0, then V^* is called the t-step 
optimal value function. For any belief state b, V^*{b) is the optimal expected reward the 
agent can get in t steps starting from b. 

The following theorem (Puterman, 1990, page 361) tells one when to terminate value 
iteration and how to construct a "good enough" policy. 

Theorem 1 Let n the policy given by 

Tr{b) =argmaXa[r{b,a)+jJ2Pio+\^^"-)^tiK)]- (7) 

0+ 

Ifmaxb^B\Vt*{b) - V*_i{b)\ < e, then 

maxb^B\V^b) - V*ib)\ < -^.D (8) 

1-7 

The quantity maa;6g^|F/(6) — Vjl^(6)| is sometimes called the Bellman residual and the 
policy TT is called the greedy policy based on V^*. 

Algorithms for POMDPs are classified into exact or approximation algorithms depending 
on whether they compute the i-step optimal value function V^* exactly (Lovejoy, 1991a). 
In the next two sections, we discuss the theoretical foundations of exact algorithms and 
develop a new exact algorithm. Thereafter, we propose a new approximation algorithm. 

4. Piecewise Linearity and Implicit Value Iteration 

Since the belief space is continuous, exact value iteration cannot be carried out explicitly. 
Fortunately, it can be carried out implicitly due to the piecewise linearity of the t-step 
optimal value functions. To explain piecewise linearity, we need the concept of policy trees. 

4.1 Policy Trees 

A t-step policy tree pt (Littman, 1994) prescribes an action for the current time point 
and an action for each possible information scenario (oi, . . . , Oj, • • ■ : o-i-i) at each of the 
next t—1 time points i. Figure 1 shows a 3-step policy tree. The tree reads as follows. 
Move-forward at the current time point. At the next time point, if oi=0 is observed then 
turn-left. Thereafter if O2=0 is observed then turn-left again; else if 02=1 is observed 
then declare-goal; else if 02=2 is observed then move-forward. And so on and so forth. 
To relate back to the introduction, a t-step policy tree prescribes a way the agent might 
behave at the current and the next t—1 time points. 
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Figure 1: A 3-step policy tree. 

When i>l, the subtree rooted at the oi node will be called a o-rooted t—l-step policy 
tree, and will be denoted by 5^^^. It is a mapping from the set of possible observations O 
to the set of all possible t—l-step policy trees; it prescribes a t—1 step policy tree ^*~^(o) 
for each possible observation o. In our example, 5^{oi=0) is the 2-step policy tree rooted 
at the uppermost ai node. 

When t>l, a t-step policy tree pi has two components: an action a for the current time 
point and an o-rooted t—l-step policy tree S^^^ for the next t—1 time points. For this 
reason, we shall sometimes write pt as a pair (a, 5^~^ ) and call a the first action oi pt- 

By altering the actions on the edges out of the a-nodes, one obtains different t-step 
policy trees. The set of all possible t-step policy trees will be denoted by Vf A 1-step 
policy tree is simply an action, and hence Vi is the same as the set of possible actions A. 

4.2 State Value Functions of Policy Trees 

For any state s and any t-step policy tree pt={a, 5^^^), recursively define 

Vp,{s) = r{s,a) +jY.Y.^St-Ho+)is+)P{s+,o+\s,a), (9) 

0+ s + 

where the second term is to be understood as when t=l. It is the expected discounted 
total reward the agent receives at the current time and during the next t—1 time points if 
the world is currently in state s and the agent behaves according to the policy tree pt. We 
call the state value function of the t-step policy tree pt. 

Without mentioning the policy tree, we shall sometimes call Vp^ a t-step state value 
function. The collection of all t-step state value functions will be denoted by Vt, i.e. 

Vt = {Vp,\pteVt}. 

For convenience, we let Vo consist of one single function of s that is zero for all s. 
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4.3 State Space Functions and Belief Space Functions 

It is worthwhile to point out that a i-step state value function is a state space function, i.e. 
a function over the state space S, while the i-step optimal value function is a belief space 
function, i.e. a function over the belief space B. We often use notations such as a or /3 to 
refer to state space functions. A state space function a{s) induces a belief space function 
through 

a{b) = ^ a{s)b{s). 

s 

Regarding 6 as a vector with one component b{s) for each s, the induced belief space function 
is a linear combination of the components of b. For convenience, we simply say that a{b) is 
linear in b. 

A collection V of state space functions induces a belief space function through 

V{b) = maXa£V(^{t>)- (10) 

Note that we are using V to denote both a set of state space functions and the belief space 
function it induces. When V is empty, V(6) is by definition. 

The induced belief space function V{b) is piecewise linear in b in the sense that, for each 
aGV, it equals a{b) in the region {b\a{b)>l3{b) for any other /3gV} of the belief space B and 
hence is linear in b in that region. 

4.4 Piecewise Linearity of Optimal Value Functions 

The following theorem was first proved by Sondik (1971). It first appeared in its present 
form in Littman (1994). 

Theorem 2 (Piecewise Linearity^ The t-step optimal value function V* is the same as 
the belief space function induced by the collection of all t-step state value functions Vt, i-e. 
for any belief state b 

v:{b)=Vt{b).n 

The theorem is true for the following reasons. V^*{b) is the reward the agent receives if it be- 
haves optimally and for any policy tree pt, Vp^ (6) is the reward the agent gets if it behaves ac- 
cording to pt- Because one of the policy trees must be optimal, V*{b) = maXp^Vp^{b)=Vt{b) . 
Due to this theorem, we say that the collection Vt of state value functions is representation 
ofF,*. 

4.5 Parsimonious Representations 

The size of increases exponentially with t. As a matter of fact, the total number of i-step 
policy trees (Cassandra, 1994) is: 

\o\*-i 

\Vt\ = \A\ loi-i . 

There are potentially the same number of t-step state value functions. Fortunately, many of 
the state value functions can be pruned without afi'ecting the induced belief space function. 
Let us make this property more explicit. 
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Let W and X be two sets of state space functions. We say that W covers X if it induces 
the same behef space function as X does, i.e. if 

= X{h) 

for any beUef state h. We say that W parsimoniously covers X ii W covers X and none 
of its proper subsets do. When W covers or parsimoniously covers X, we refer to W as a 
covering or a parsimonious covering of X. 

Theorem 3 All parsimonious coverings of a set of state space functions consist of the same 
number of state space functions. □ 

The theorem has been known for sometime (e.g., Littman, 1994). Due to this theorem, one 
can also define a parsimonious covering as a covering that contains the minimum number 
of state space functions. 

A parsimonious covering Vt of Vt is also a representation of V* in the sense that Vt{b) = 
V*{b) for any belief state b. This representation is parsimonious because it consists of the 
fewest number of state space functions among all the representations of V^*. 

4.6 Dynamic-Programming Updates 

The question now is how to obtain a parsimonious covering of Vt- As will be shown in 
the next section, it is possible to obtain a parsimonious covering of Vt by starting from a 
parsimonious covering of Vt-i- The process of computing a parsimonious covering of Vt 
from a parsimonious covering of Vt-i is called dynamic-programming updates (Littman et 
at, 1995). It is a key step in algorithms that solve POMDPs via value iteration. 

Previous algorithms for dynamic-programming updates include the enumeration and 
pruning algorithms by Monahan (1992), Eagle (1984), and Lark (White, 1991), the one- 
pass algorithm by Sondik (1971), the linear support and relaxed region algorithms by Cheng 
(1988), and the witness algorithm by Cassandra et al. (1994) and Littman (1994). The 
witness algorithm has been empirically proved to be the most efficient among all those 
algorithms (Littman et al., 1995). 

4.7 Implicit Value Iteration 

The procedure solvePOMDP shown in Figure 2 carries out value iteration implicitly: instead 
inductively computing the i-step optimal value function V* itself, it computes a parsimo- 
nious covering of — a set of state space functions that represents V* . In the procedure, 
the subroutine update(Vt_i) takes a parsimonious covering V(-i of V(-i and returns a parsi- 
monious covering Vt of Vt- It can be implemented using any of the algorithms mentioned in 
the previous subsection. The subroutine stop(Vt, Vt_i, e) determines whether the Bellman 
residual has fallen below the threshold e from the parsimonious coverings Vt-i and Vt of 
Vt-i and Vt- See Littman (1994) for an implementation of this subroutine. 

Procedure solvePOMDP terminates when the Bellam residual falls below the threshold 
e and return a set of state space functions. The set Vt of state space functions returned 
represents the i-step optimal value function F/. It is the solution to the input POMDP. 
The planning agent keeps Vt in its memory. When it needs to make a decision, the agent 
consults its belief state b and chooses an action using (7) with V/(6+) replaced by Vt{i>+)- 
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Procedure solvePOMDP(A^, e): 

• Input: M — k POMDP, 

e — A positive number. 

• Output: A set of state space functions. 

1. i ^ 0, Vo ^ {0}. 

2. Do 

• t=t+l. 

• Vt^ update (Vt_i). 
while stop(Vt, Vt_i, e) = no. 

3. Return Vf 



Figure 2: Implicit value iteration. 

5. A New Algorithm for Dynamic-Programming Updates 

This section proposes a new algorithm for dynamic-programming updates. There are four 
subsections. In the first three subsections, we show that a parsimonious covering of Vt can 
be obtained by starting from a parsimonious covering of Vt_i and, while doing so, introduce 
concepts and results that are necessary for the development of the new algorithm. 

5.1 Relationship Between Vt-i and Vt 

Suppose W and X are two sets of state space functions. The cross sum W^X of W and 
X is the following set of state space functions: 

W0;f = {a+(5\a^W,(3(^X}. 

It is evident that the cross sum operation is commutative and associative. Consequently, we 
can talk about the cross sum Wj of a list of sets Wo, . . . , Wat of state space functions. 
For any action a and any observation o+, define 

= {7l]a(s+)^(s+,o+|s,a)|aeVt-i}. (11) 

«+ 

Note that since a and o+ are given, each member 7 Xl.s+ '^(•'^+)-^(''*+? 0+15,0) of the above 
set is a function of s, in other words, a state space function. Hence Qa,o+ is a set of state 
space functions. Let 0, 1, . . . , be an enumeration of all possible values of o+. We use 
®o+Qa,o+ to denote the cross sum 0,^o Qa,i. 

Proposition 1 Vt = VJa[{r{s,a)}®{® Qa,o+)]. 
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Proof: By the definition of the set Vt and Equation (9), a state space function a is in Vt 
if and only if there exist action a and f3o+ G Vt-i for each o+ G O such that 

a{s) = r{s,a) + ^7^/3o+(s+)-P(s+,o+|s,a). 

o+ s+ 

The proposition follows. □ 

5.2 Properties of Coverings 

Lemma 1 Suppose W, X , and y are three sets of state space functions. If W parsimo- 
niously covers X and X covers y, then W parsimoniously covers ofy. □ 

Lemma 2 Let W , W , X, and X' he four sets of state space functions. If W covers W 
and X' covers X, then 

1. W®X' covers WQiX. 

2. W'UX' covers WUX.O 

5.3 Coverings of Vt from Parsimonious Coverings of Vt i 

Let Vt-i be a parsimonious covering of Vt-i- For any action a and any observation o+, 
define 

Q'a,o+ = {7l]a(5+)^(s+,o+|s,a)|aGVt_i}. 

Note that the definition of Q'a^o+ ''^e same as that of Qa,o+ except that Vt-i is replaced 
by Vt-i- Also define 

V^ = U„[{r(.,a)}0(©^^Q;„J]. 

Proposition 2 The set V[ covers Vt. 

Formal proof of this proposition is given in Appendix A. Informally, the fact that Vt-\ 
covers Vt-i implies that Q!a,o+ covers Qa,o+: which in turn implies that V[ covers Vt due to 
Proposition 1 and Lemma 2. 

According to Proposition 2 and Lemma 1, one can obtain a parsimonious covering of 
Vt by finding a parsimonious covering of V't and V[ is defined in terms of a parsimonious 
covering Vt-i of Vt-i- This is why we said that a parsimonious covering of Vt can be 
obtained by starting from a parsimonious covering of Vt-i- 

Monahan's exhaustive method finds a parsimonious covering of V[ by enumerating all 
the state space functions in V't one by one and detecting those that can be pruned by solving 
linear programs. Lark's algorithm works in a similar fashion except that its linear programs 
have fewer constraints. Since V't consists of |^||Vt_i|l'-'l state space functions, enumerating 
them one by one is very expensive. Other algorithms (Sondik, 1971; Cheng, 1988; Littman, 
1994) avoid this difficulty by exploiting the structures of V^. 
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Procedure incrPruning({Wo, >Vi, . . . , Wat}): 

1. W^Wo. 

2. For i = 1 to iV, 

W ^ purge (W0Wi). 

3. Return W. 

Procedure update (Vt_i): 

• Input: Vt-i — a parsimonious covering of Vt-i. 

• Input: a parsimonious covering of Vt- 

1. For each a&A, 

(a) Compute the sets Q'afi, Q'a,i: • • • ? and Q[j ^. 

(b) Wa ^ incrPruning({Q:,^o, S'a,i, ■ ■ ■ , 2;,,^'}). 

2. Return purge(U„[{r(s,a)}eW«]). 



Figure 3: Incremental pruning and dynamic-programming updates. 
5.4 A New Algorithm for Dynamic-Programming Updates 

Let purge (W) be a subroutine that takes a set W of state space functions and returns a 
parsimonious covering of W. An implementation of purge can be found in the appendix. 

Let Wo 5 ■ ■ ■ 5 be sets of state space functions. Consider the procedure incrPruning 
shown in Figure 3. Let n be a number between and N. Using Lemmas 1 and 2, one can 
easily show by induction that, at the end of the nth pass through the for-loop, the set W is 
a parsimonious covering of 0"=o Consequently, the procedures returns a parsimonious 
covering of ©^qW^. The procedure is named incremental pruning because pruning takes 
place after each cross sum. 

Let 0, 1, . . . , A'^ be an enumeration of all the possible observations, i.e. possible in- 
stantiations of o+. For any action a, suppose the sets Q'afi, Q!a,i^ ■ ■ ■ ■> and Q'^ ^ have 
been computed. Applying incrPruning to those sets, we get a parsimonious covering of 
Q'q j. Denote it by Wa- According to Lemma 2, Ua[{r(5, a)}0Wa] covers V[. Due 
to Lemma 1, this fact implies that a parsimonious covering of UQ[{r(s, a)}0Wa] is also a 
parsimonious covering of V[ and hence of Vt. Thus, a parsimonious covering of Vt can be 
found from a parsimonious covering of Vt-i using the procedure update shown in Figure 3. 

We also use the term incremental pruning to refer to the above algorithm for dynamic- 
programming updates. It has been shown elsewhere (Cassandra et al, 1977) that incremen- 
tal pruning has the same asymptotic complexity as the witness algorithm and empirically 
it significantly outperforms the latter. 
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6. Region-Based Model Approximation 

We have so far been concerned with exact algorithms. Experiments with incremental prun- 
ing, presently the most efficient exact algorithm, have revealed that it can solve only small 
POMDPs (Cassandra et at, 1997). One needs to resort to approximation in order to solve 
large real-world problems. 

Most previous approximation methods solve a POMDP directly; they approximate the 
i-step optimal value function of the POMDP. In the rest of this paper, we develop a new 
method that approximates a POMDP itself by another that has a more informative obser- 
vation model and is hence easier to solve. The latter POMDP is solved and its solution is 
used to construct a solution to the original POMDP. 

6.1 The Basic Idea 

We make the following assumption about problem characteristics, in a POMDP M., even 
though an agent does not know the true state of the world, it often has a good idea about 
the state. Justifications for this assumption were given in the introduction and empirical 
evidence is presented by Simmons Sz Koenig (1995). 

Consider another POMDP M-' that is the same as M. except that in addition to the 
observation made by itself, the agent also receives a report from an oracle who knows the 
true state of the world. The oracle does not report the true state itself. Instead it selects, 
from a predetermined list of candidate regions, a region that contains the true state and 
reports that region. 

More information is available to the agent in A4' than in A4; additional information 
is provided by the oracle. Since in M. the agent already has a good idea about the true 
state of the world, the oracle might not provide much additional information even when the 
candidate regions are small. Consequently, Ai' could be a good approximation of Ai. 

In M', the agent knows for sure that the true state of the world is in the region reported 
by the oracle. For this reason, we say that it is region observable. The region observable 
POMDP M.' can be much easier to solve than M. when the candidate regions are small. For 
example, if the oracle is allowed to report only singleton regions, then it actually reports 
the true state of the world and hence M' is an MDP. MDPs are much easier to solve than 
POMDPs. One would expect the region observable POMDP M.' to be solvable when the 
candidate regions are small. 

6.2 Spectrum of Approximations 

If the region reported by the oracle is always the set of all possible states, then no additional 
information is provided, because the report that the true state of the world is one of the 
possible states has no information content. In this case, M.' has the same solution as M. 
and solving M.' is equivalent to solving M. directly. This is one extreme of the spectrum. 

At the other extreme, if all the candidate regions are singletons, the oracle reports the 
true state of the world. Maximum amount of additional information is provided and M' 
is actually an MDP. The MDP might not be a good approximation of M. but it is much 
easier to solve than M. 
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Previous methods for solving a POMDP either solve it directly or to approximate it 
by using a MDP. By allowing the oracle to report regions that are neither singletons nor 
the set of all possible states, this paper opens up the possibility of exploring the spectrum 
between those two extremes. One way to explore the spectrum is to start with singleton 
candidate regions and increase their sizes gradually. Approximation quality and computa- 
tional time both increase as one goes along. One stops when the approximation is accurate 
enough or the region observable POMDP becomes intractable. A method for determining 
approximation quality will be described later. 

We now set out to make these ideas more concrete by starting with the concept of region 
systems. 

6.3 Region Systems 

A region is simply a subset of states of the world. A region system is a collection of regions 
such that no region is a subset of another region in the collection and the union of all regions 
equals the set of all possible states of the world. We use R to denote a region and TZ to 
denote a region system. 

Region systems are used to restrict the regions that the oracle can choose to report. The 
choice of a region system determines the computational complexity of the region observable 
POMDP Ai' and approximation quality. How to choose regions so as to make proper 
tradeoff between computational time and approximation quality is an open research issue. 
Here is a preliminary approach. The idea is to create a region for each state by including 
its "nearby" states. We say a state s' is ideally reachable in one step from another state s 
if after executing a certain action in state s, the probability of the world ending up in state 
s' is the highest. A state is ideally reachable in k steps from another state sq if there are 
state si, . . . , Sk-i such that Sj+i is ideally reachable from Sj in one step for all 0<«<A;— 1. 
Any state is ideally reachable from itself in step. 

For any non-negative integer k, the radius-k region centered at a state s is the set of 
states that are ideally reachable from s in A; or less steps. A radius-k region system is the one 
obtained by creating a radius-A; region for each state and then removing, one after another, 
regions that are subsets of others. 

When k is 0, the radius-/c region system consists of singleton regions. On the other 
hand, if each state is reachable from any other state in k or less steps, there is only one 
region in the radius-A; region system — the set of all possible states. 

6.4 Region Observable POMDPs 

Given a region system TZ and a POMDP M, we construct a region observable POMDP ^A' 
by assuming that at each time point the agent not only obtains an observation by itself but 
also receives a report from an oracle who knows the true state of the world. The oracle does 
not report the true state itself. Instead it chooses from TZ one region that contains the true 
state and reports that region. 

The amount of additional information provided by the oracle depends not only on the 
region system used but also on the way the oracle chooses regions. For example, if the 
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oracle always reports the region centered at the true state, then it implicitly reports the 
true state itself. 

In order to provide as little additional information as possible, the oracle should consider 
what the agent already knows. However, it cannot take the entire history of past actions 
and observations into account because if it did, M! would not be a POMDP. The current 
observation would depend on the entire history. 

For any non- negative state space function /(s) and any region i?, we call the quantity 
supp{f, R)= ^seR-fi^) I Sse5 /(^) the degree of support of / by R. Note that when / is a 
probability distribution, the denominator is 1. If i? supports / to degree 1, we say that R 
fully supports f. 

We suggest the following region-selection rule for the oracle. Let s_ be the previous true 
state of the world, a. be the previous action, and o be the current observation. The oracle 
should choose, among all the regions in TZ that contain the true state of the world, one that 
supports the function P(s, o|s_, a.) of s to the maximum degree. Where there is more than 
one such regions, choose the one that comes first in a predetermined ordering among the 
regions. 

Here are some arguments in support of the rule. If the previous world state s_ were 
known to the agent, then its current belief state b{s), a function of s, would be proportional 
to P{s,o\s_,a.). In this case, the rule minimizes additional information in the sense that 
the region reported supports the current belief state to the maximum degree. If the pre- 
vious world state is known to be around s., the same is roughly true. Also if the current 
observation is informative enough, being a landmark for instance, to ensure that the world 
state is in a certain region, then the region chosen using the rule fully supports the current 
belief state. In such a case, no additional information is provided. Despite those arguments, 
we do not claim that the rule described above is optimal. Finding a rule that minimizes 
additional information is still an open problem. 

The probability P{R\s,o,s ,a_) of a region R being chosen under the above scheme is 
given by 



P{R\s, o, s_, a.) 



1 if i? is the first region s.t. s^R and for any other region R' 

T.s'&R Pis', o\s_, a.)> T,s'&R' Pis', o\s., a_) 
otherwise. 



The region observable POMDP M' differs from the original POMDP M only in terms 
of observation; in addition to the observation o made by itself, the agent also receives a 
report R from the oracle. We shall denote an observation in M.' by z and write z=(o, i?). 
The observation model of M' is given by 

P(z|s, a_, s_) = P{o, R\s, a_, s_) = -P(o|s, a_)P{R\s, o, s_, a_). 

The joint conditional probability P{s+, z+\s,a) of the next state s+ of the world and the 
next observation z+ given the current state s and the current action a is 

P(s_|., z+\s, a) = P{s+\s, a)P{z+\s+, a, s). 
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7. Solving Region Observable POMDPs 

In principle, the region observable POMDP M' can be solved in the same way as general 
POMDPs using the procedure solvePOMDP. It is not advisable to do so, however, since 
solvePOMDP does not automatically exploit region observability. This section and the next 
two sections develop an algorithm for solving Ai' that takes advantage of region observabil- 
ity. 

7.1 Restricted Value Iteration 

For any region R, let Br be the set of belief states that are fully supported by R. Let TZ be 
the region system underlying the region observable POMDP A^'. Define Bfi = Ur^tzBr. 

It is easy to see that, in M.', no matter what the current belief state b is, the next belief 
state b+ must be in Bjz. We assume that the initial belief state is in Bjz. Then all possible 
belief states the agent might encounter are in B-jz- This implies that policies for M' need 
only be defined over B-r. and value iteration for M' can be restricted to the subset Btz of B. 

We restrict value iteration for Ai' to Bjz for the sake of efficiency. Doing so implies 
that the i-step optimal value function of M.', denoted by ?7*, is defined only over Btz and 
the Bellman residual is now maxi,:^Q^\Ul{b) — To avoid confusion, we call it the 

restricted Bellman residual and call the restricted t-step optimal value function. 

Since Bjz is continuous, restricted value iteration cannot be carried out implicitly. The 
next subsection shows how it can be carried implicitly. 

7.2 Implicit Restricted Value Iteration 

Let W and X be two sets of state space functions and let i? be a region. We say that W 
covers X in region R if, for any bEB^, 

W{b) = X{b). 

We say that W parsimoniously covers X in region i? if W covers X in region R and none 
of its proper subsets do. When W covers or parsimoniously covers X va. a. region, we refer 
to W as a regional covering or a parsimonious regional covering of X. 

Let Ut be the set of all t-step state value functions of M.' ■ According to Theorem 2, 

unb)=utib) 

for any belief state b&Bjz- 

For each region i?, suppose Ut^R is a set of state space functions that parsimoniously 
covers Ut in region R. Then the collection {ilt,R\^^^} ^ representation of in the sense 
that for any b^Bjz, 

u:{b)=ut,R,{b), (12) 

where Ri, is such that b^BR, i.e. such that Ri, fully supports b. 

As will be shown in the next section, parsimonious regional coverings of Ut can be ob- 
tained from parsimonious regional coverings of Zit_i. Let R0P0MDPupdate(i?,ZYj_i^jR^|i?+G7^}) 
be a procedure that takes a region R and parsimonious regional covering {h{t-i^R_f_\R+€.Tl} 
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Procedure solveROPOMDP(A^', e) 

• Input: jM' — A region observable POMDP, 

e — A positive number. 

• Output: A list of sets of state space functions. 

1. t^O. 

2. For Ren, Uo,R ^ {0}. 

3. Do 

• t ^ t+1. 

• For each ReTZ, 

Ut,R ^ ROPOMDPupdate(i?, {Ut-i,R+\R+en}). 

while ROPOm)Pstop{{Ut,R\Ren},{Ut-i,RjR+en},e) = no. 

4. Return {Ut,R\Ren}. 



Figure 4: Implicit restricted value iteration for region-observable POMDPs. 

oiUt-i and returns a set of state space functions that parsimoniously covers Ut in region R 
^. Let ROPOMDPstop be a procedure that determines, from parsimonious regional coverings 
oiUt-i and Ut, whether the restricted Bellman residual has fallen below a predetermined 
threshold. 

The procedure solveRQPQMDP shown in Figure 4 carries out restricted value iteration 
implicitly: instead inductively computing the restricted t-step optimal value function [/* 
itself, it computes parsimonious regional coverings of Ut- In other words, it computes sets 
of state space functions that represent Ut in the sense of (12). 

Let tt' be the greedy policy for M' based on Ut ■ For any h^Bfi, ^'(^) defined by 
Equation (7) with o+ replaced by z^={o+,R+) and replaced by U^ . Since the list 
{Ut^R\Re'Ti) of sets of state space functions returned by solveRQPQMDP represents Ut in the 
sense of (12), we have that for any bEBji 

n'ib) = arg maxa[r{b,a) +-f ^ P{{o+, R+)\b,a)l2t,R+{b+)]. (13) 

o+,R+ 

The next two sections show how to implement the procedures ROPOMDPupdate and 
RQPQMDPstop. 



2. The string "ROPOMDP" in ROPOMDPupdate stands for region-observable POMDP. 
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8. Dynamic-Programming Updates for Region Observable POMDPs 

This section shows how the incremental pruning algorithm developed in Section 5 can 
be adapted to compute parsimonious regional coverings of Ut from parsimonious regional 
coverings of Ut-i- 

8.1 Properties of Regional Coverings 

Lemma 3 Let R he a region and let W, X, and y he three sets of state space functions. If 
W parsimoniously covers X in region R and X covers y in region R, then W parsimoniously 
covers y in region R. □ 

Lemma 4 Let R he a region and let W, W , X, and X' he four sets of state space functions. 
If W and X' respectively cover W and X in region R, then 

1. W'®X' covers W®X in region R. 

2. W'UX' covers WUX in region R. □ 

8.2 Regional Coverings of Ut from Parsimonious Regional Coverings of Ut-i 

Prom parsimonious regional coverings Ut-i,R_^ {R+&TZ) oiUt-i, this subsection constructs, 
for each region R^TZ, a set Ut^R of state space functions and shows that it covers Ut in region 
R. 

For any action a and any observation z+={o+, R+) of M', let Qa,z+,R be the set of all 
state space functions f3 that are of the following form: 

f^(,^ _ I lT.s+(^is+)Pis+,z+\s,a) if sGi? 

> ~ \ otherwise. ^ > 

where a G Ut-i^R^. Define 

Ut,R = U„[{r(5,a)}0(0 2.,.+ ,,^)]. 

Proposition 3 The set Ut^R covers Ut in region R. 

Formal proof of this proposition can be found in Appendix A. Informally, the fact that 

Ut-i,R+ covers Ut-i in region R+ implies that Qa,z+,R covers Qa,z+ in region R, where Qa,z+ 
is given by (11) with o+ and Vj-i replaced by z+ and Ui-i. This fact in turn implies that 
Ut^R covers Ut in region R because of Proposition 1 and Lemma 4. 

8.3 Possible Observations at the Next Time Point 

In the definition oiUt^R, the cross sum is taken over all possible observations. This subsection 
shows that some of the possible observations can be skipped. 
For any action a and any region i?, define 

Za,R = {^+1 ^ z+\^i a) > for some sGi? }. 
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Procedure R0P0MDPupdate(i?,ZYt_i,B_^|i?+G7^}): 

• Inputs: R — A region, and for any region R^, 

l^t-i,R+ parsimoniously covers Ut-i in region R^. 

• Output: A set of state space functions that parsimoniously covers Ut in 
region R. 

1. For each action a, 

(a) Compute the set Za^R and enumerate its members as 0, 1, ... , M. 

(b) For i=0 to M, compute the set Qa,i,R- 

(c) Wa^restrictedIncrPruning({Qa,o,fi, Qa,i,R, ■ ■ ■ , Qa,M,R}, R)- 

2. Return purge (Ua[{r(s, a)} Wa], i?). 

Subroutine restrictedIncrPruning({Wo, Wi, . . . , Wm}i R)'- 

1. Let W^Wo. 

2. For i=l to M, 

W^purge(>V0Wi,i?). 

3. Return W. 



Figure 5: Dynamic-programming updates for region observable POMDPs. 

It is the set of observations that the agent can possibly receive at the next time point given 
that the current state of the world lies in region R and the current action is a. There are 
many observations outside this set. As a matter of fact, an observation z+={o+, R+) is not 
in the set if it is not possible to reach region Rj^ from region R in one step. 

For any 2;+=(o+, if ;2+^^a,ii, then X].s+ 0() = for all s^R. In such a 
case, Qa,z+,R = {0} according to (14). Since, {O}0>V=yV for any set W of state space 
functions, we have 

8.4 Parsimonious Regional Covering of Ut 

Proposition 3 and Lemma 3 imply that, for any region i?, a set of state space functions 
parsimoniously covers Ut in region R if and only if it parsimoniously covering Ut,R in region 
R. According to Lemmas 3 and 4, a set of state space functions that parsimoniously covers 
Ut,R in region R can be found using the procedure ROPOMDPupdate shown in Figure 5 (c.f. 
Section 5.4). In the procedure, the subroutine purge(>V, i?) takes a set W of state space 



218 



Model Approximation for Planning under Uncertainty 



Procedure R0P0MDPstop({^^^,i^|i^G7^}, {^^^_l,i^|i^G7^}, e) 

• Inputs: e — A positive number, and for any region R 

i^t.R covers Ut in region R, and 
Ui-i,R covers Ut-i in region R. 

• Outputs: yes — If the restricted Bellman residual < e, 

no — Otherwise. 

1. For each region i?, 

(a) flag yes. 

(b) For each a^Ut-i,R, 

flag no if dominate{a, Ut^R, R, e) ^ nil. 

(c) For each a&Ut,R, 

flag no if domiii.a.te{a,l/{t-i^R, R, e) ^ nil. 

(d) Return no if flag = no. 

2. Return yes. 



Figure 6: Procedure for determining whether the restricted Bellman residual has fallen 
below a threshold. 



functions and region R, and returns a set of state space functions that parsimoniously covers 
W in region R. An implementation of this subroutine can be found in Appendix B. 

9. The Stopping Condition 

This section shows how to determine whether the restricted Bellman residual has fallen 
below a predetermined threshold e from regional coverings of Ut and Ut-i- For any region 
R, let Ut^R and Ut-i,R be two sets of state space functions that respectively cover Ut and 
Ut-i in region R. By the deflnition of regional coverings, we have 

Lemma 5 The restricted Bellman residual is no larger than e if and only if for any region 
R and any belief state b^BR, 



1. For any a&At^R, 

2. For any a&At-iK, 



a{b) < Ut-i,R{b)+e, and 
a{b) < Ut,R{b)+e-° 
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Let dominate (a, W, i?, e) be a procedure that returns a belief state b in Br such that 
a{b) > yy(5)+e. If such a belief state does not exist, it returns nil. An implementation 
of this procedure can be found in Appendix B. The procedure ROPOMDPupdate shown in 
Figure 6 returns yes if the restricted Bellman residual has fallen below e and no otherwise. 

A couple of notes are in order. First, when the reward function r (s, a) is non-negative, Uf 
increases with t. In this case, the restricted Bellman residual becomes maxi,^^^ {Ul{b)—Uf_i{b)). 
Consequently, step (c) can be skipped. Second, when r{s,a) takes negative values for some 
s and some a, a constant can be added to it so that it becomes non-negative. Adding a con- 
stant to r{s,a) does not affect the optimal policy. However, it makes it easier to determine 
whether the restricted Bellman residual has fallen below a threshold. 

10. Decision Making for the Original POMDP 

Suppose we have solved the region observable POMDP M-'. The next step is to construct 
a policy tt for the original POMDP M. based on the solution for M.'. 

Even though it is our assumption that in the original POMDP M. the agent has a good 
idea about the state of the world at all times, there is no guarantee that its belief state are 
always in Bjz. There is no oracle in A^. A policy should prescribe actions for belief states 
in Btz as well as for belief states outside Bjz- One issue here is that the policy tt' for M' is 
defined only for belief states in Bjz- Fortunately, tt' can be naturally extended to the entire 
belief space by ignoring the constraint b^Bn in Equation (13). We hence define a policy tt 
for M. as follows: for any b&B, 

Trib) = arg maxa[r{b,a) +'y ^ P{{o+, R+)\b,a)Ut,R+{b+)]. (15) 

o+,R+ 

Let k be the radius of the region system underlying M.'. The policy tt given above will 
be referred to as the radius-k approximate policy for M.. The entire process of obtaining 
this policy, including the construction and solving of the region observable POMDP M.\ 
will be referred to as region-based approximation. 

It is worthwhile to compare this equation with Equation (7). In Equation (7), there are 
two terms on the right hand side. The first term is the immediate reward for taking action 
a and the second term is the discounted future reward the agent can expect to receive if it 
behaves optimally. Their sum is the total expected reward for taking action a. The action 
with the highest total reward is chosen. 

The second term is difficult to obtain. In essence, Equation (15) approximates the 
second term using the optimal expected future reward the agent can receive with the help 
of the oracle, which is easier to compute. 

It should be emphasized that the presence of the oracle is assumed only in the process 
of computing the radius-k approximate policy. The oracle is not present when executing 
the policy. 

11. Quality of Approximation and Simulation 

In general, the quality of an approximate policy tt is measured by the distance between its 
value function V^{b) and the optimal value function V*{b). This measurement does not 
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consider what an agent might know about the initial state of the world. As such, it is not 
appropriate for a policy obtained through region-based approximation. One cannot expect 
such a policy to be of good quality if an agent is very uncertain about the initial state of 
the world because it is obtained under the assumption that an agent has a good idea about 
the state of the world at all times. 

This section describes a scheme for determining the quality of an approximate policy in 
cases where an agent knows the initial state of the world with certainty. The scheme can 
be generalized to cases where there is a small amount of uncertainty about the initial state; 
for example, cases where the initial state is known to be in some small region. 

An agent might need to reach a goal from different initial states at different times. Let 
P{s) be the frequency it will start from state s^. The quality of an approximate policy tt 
can be measured by X)s — {s)\P{s), where V*{s) and V^{s) denote the rewards 

the agent can expect to receive starting from state s if it behaves optimally or according to 
TT respectively. 

By definition V*{s)>V^{s) for all s. Let U* be the optimal value function of the 
region observable POMDP A4' . Since more information is available to the agent in A4' , 
U*{s)>V*{s) for alls. Therefore, Y,s[U*{s) -V {s)]P{s) is an upper bound on - 

V''{s)]P{s). 

Let it' be the policy for M' given by (13). When the restricted Bellman residual is small, 
tt' is close to optimal for M' and the value function of n' is close to U*. Consequently, 
J2s[V'{s) - V''{s)]P{s) is an upper bound on Es[V^*(s) - V''{s)]P{s) when the restricted 
Bellman residual is small enough. 

One way to estimate the quantity XI.JF'^ (•§) — {s)]P{s) is to conduct a large number 
of simulation trials. In each trial, an initial state is randomly generated according to P{s). 
The agent is informed of the initial state. Simulation takes place in both M. and M.'. In 
jM, the agent chooses, at each step, an action using it based on its current belief state. The 
action is passed to a simulator which randomly generates the next state of the world and the 
next observation according to the transition and observation probabilities. The observation 
(but not the state) is passed to the agent, who updates its belief state and chooses the next 
action. And so on and so forth. The trial terminates when the agent chooses the action 
declare-goal or a maximum number of steps is reached. Simulation in M' takes place in a 
similar manner except that the observations and the observation probabilities are different 
and actions are chosen using tt'. 

If the goal is correctly declared at the end of a trial, the agent receives a reward in the 
amount 7", where n is the number of steps. Otherwise, the agent receive no reward. The 
quantity EJ^^ (s) -V" {s)]P{s) can be estimated using the difference between the average 
reward received in the trials for M.' and the average reward received in the trials for M.. 

12. Tradeoff Between Quality and Complexity 

Intuitively, the larger the radius of the region system, the less the amount of additional 
information the oracle provides. Hence the closer M-' is to Ai and the narrower the gap 
between Es V'^ ^^'^ Es V'^{s)P{s). Although we have not theoretically proved this, 

3. This is not to be confused with the initial belief state Po, which represents the agent's knowledge about 
the ninitial state at a particular trial. 
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empirical results (see the next section) do suggest that J2s^^ increases with the 

radius of the region system while Y,s {■'^)^{-'^) decreases with it. In the extreme case 
when there is one region in the region system that contains all the possible states of the 
world, M. and M.' are identical and hence so are {s)P{s) and J2s V'^ {s)P{s). 

These discussions lead to the following scheme for making the tradeoff between com- 
plexity and quality: Start with the radius-0 region system and increase the radius gradually 
until the quantity ^^^[F'^ {s) — VTr{s)]P{s) becomes sufficiently small or the region observable 
POMDP M' becomes untractable. 

13. Simulation Experiments 

Simulation experiments have been carried out to show that (1) approximation quality in- 
creases with radius of region system and (2) where there is not much uncertainty, a POMDP 
can be accurately approximated by a region-observable POMDP that can be solved exactly. 
This section reports on the experiments. 

13.1 Synthetic Office Environments 

Our experiments were conducted using two synthetic office environments borrowed from 
Cassandra et al. (1996) with some minor modifications. Layouts of the environments 
are shown in Figure 7, where squares represent locations. Each location is represented as 
four states in the POMDP model, one for each orientation. The dark locations are rooms 
connected to corridors by doorways. 

In each environment, a robot needs to reach the goal location with the correct orien- 
tation. At each step, the robot can execute one of the following actions: move-forward, 
turn-left, turn-right, and declare-goal. The two sets of action models given in Figure 
7 were used in our experiments. For the action move-forward, the term F-F (0.01) means 
that with probability 0.01 the robot actually moves two steps forward. The other terms are 
to be interpreted similarly. If an outcome cannot occur in a certain state of the world, then 
the robot is left in the last state before the impossible outcome. 

In each state, the robot is able to perceive in each of three nominal directions (front, 
left, and right) whether there is a doorway, wall, open, or it is undetermined. The two 
sets of observation models shown in Figure 7 were used in our experiments. 

13.2 Complexity of Solving the POMDPs 

One of the POMDPs has 280 possible states while the other has 200. They both have 64 
possible observations and 4 possible actions. Since the largest POMDPs that researchers 
have been able to solve exactly so far have less than 20 states and 15 observations, it is safe 
to say no existing exact algorithms can solve those two POMDPs. 

We were able to solve the radius-0 and radius- 1 approximations (region observable 
POMDPs) of the two POMDPs on a SUN SPARC20 computer. The threshold for the 
Bellman residual was set at 0.001 and the discount factor at 0.99. The amounts of time it 
took in CPU seconds are collected in the following table. 



222 



Model Approximation for Planning under Uncertainty 



a 



Goal -^D C 3 ^ y 3 ^ 

(south) 



Envlroment B 



(east) 



Environment A 



Transition Probabilities 



Action 


Standard outcomes 


Noisy outcomes 


move-forward 
turn-left 
turn-right 
declare-goal 


N (0.11), F (0.88), F-F (0.01) 
N (0.05), L (0.9), L-L (0.05) 
N (0.05), R (0.9), R-R (0.05) 
N (1.0) 


N (0.2), F (0.7), F-F (0.1) 
N (0.15), L (0.7), L-L (0.15) 
N (0.15), R (0.7), R-R (0.15) 
N (1.0) 



Observation Probabilities 



Actual case 


Standard observations 


Noisy observations 


wall 


wall (0.90), open (0.04), doorway 
(0.04), undetermined (0.02) 


wall (0.70), open (0.19), doorway 
(0.09), undetermined (0.02) 


open 


wall (0.02), open (0.90), doorway 
(0.06), undetermined (0.02) 


wall (0.19), open (0.70), doorway 
(0.09), undetermined (0.02) 


doorway 


wall (0.15), open (0.15), doorway 
(0.69), undetermined (0.01) 


wall (0.15), open (0.15), doorway 
(0.69), undetermined (0.01) 



Figure 7: Synthetic Office Environments. 



Environment 


Standard models 


Noisy models 




Radius-0 Radius- 1 


Radius-0 Radius- 1 


A 


1.26 3373 


1.35 5984 


B 


0.61 2437 


0.72 3952 



We see that the radius- 1 approximations took much longer time to solve than the radius-0 
approximations. Also notice that the region observable POMDPs with noisy action and 
observation models took more time to solve that those with the standard models. This sug- 
gests that the more nondeterministic the actions and the less informative the observations, 
the more difficult it is to solve a POMDP. 
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We were unable to solve the radius-2 approximations. Other approximation techniques 
need to be incorporated in order to solve the approximations based on region systems with 
radius larger than or equal to 2. 

13.3 Approximation Quality for Standard Models 

To determine the quality of the radius-0 and radius-1 approximate policies for the POMDPs 
with standard action and observation models, 1000 simulation trials were conducted using 
the scheme described in Section 11. It was assumed that the agent is equally likely to start 
from any state. Average rewards obtained in the original POMDPs M (i.e. without the 
help of the oracle) and in the corresponding region-observable POMDPs Ai' (i.e. with the 
help of the oracle) are shown in the following table. 







Environment A 


Environment B 






radius-0 radius-1 


radius-0 radius-1 


Average reward in 
Average reward in 
Difference 


M 
M' 


0.806535 0.815695 
0.827788 0.818534 
0.021253 0.002839 


0.866118 0.868533 
0.883271 0.876356 
0.017153 0.007823 



We see that, when the radius-0 policies were used, the differences between the rewards 
obtained in M. and those obtained in M.' are very small in both environments. This indicates 
that the radius-0 region observable POMDPs (i.e. MDPs) are accurate approximations of 
the original POMDPs. The radius-0 approximate policies are close to optimal for the 
original POMDPs. When the radius-1 policies were used, the differences are even smaller; 
the rewards obtained in M. and those obtained in M.' are essentially the same. 

Consider the rewards obtained in the original POMDPs. We see that they are larger 
when radius-1 policies were used than when radius-0 policies were used. This supports our 
claim that approximation quality increases with radius of region system. 

There is a another fact worth mentioning. The differences between rewards obtained in 
M and those obtained in M.' are larger in Environment B than in Environment A. This is 
because Environment B is more symmetric and consequently observations are less effective 
in disambiguating uncertainty in the agent's belief about the state of the world. 

13.4 Approximation Quality for Noisy Models 

One thousand trials were also conducted for the POMDPs with noisy action and observa- 
tion models. Results are shown in the following table. 







Environment A 


Environment B 






radius-0 radius-1 


radius-0 radius-1 


Average reward in 
Average reward in 
Difference 


M 
M' 


0.596670 0.634934 
0.812898 0.722441 
0.214228 0.087507 


0.445653 0.565099 
0.871903 0.818365 
0.426250 0.253266 
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We see that the differences between rewards obtained in M. and rewards obtained in 
A4' are significantly smaller when the radius- 1 policies were used than when the radius-0 
policies were used. This is the case especially in Environment A. Also the rewards obtained 
in M. are larger when the radius- 1 policies were used than when the radius-0 policies were 
used. Those again support our claim that approximation quality increases with radius of 
region system. 

As far as absolute approximation quality is concerned, the radius-0 POMDPs (i.e. 
MDPs) are obviously very poor approximations of the original POMDPs; when the radius-0 
policies were used, the rewards obtained in M are significantly smaller than the rewards ob- 
tained in M.'. For Environment A, the radius-1 approximation is fairly accurate. However, 
the radius-1 approximation remains poor for Environment B. The radius of region system 
needs to be increased. 

Tracing through the trials step by step, we observed some interesting facts. In Envi- 
ronment B, the agent, under the guidance of the radius-1 approximate policy, was able to 
quickly get to the neighborhood of the goal even when starting from far way. The fact 
that the Environment around the goal is highly symmetric was the cause of the poor per- 
formance. Often the agent was not able to determine whether it was at the goal location 
(room), or in the opposite room, or in the left most room, or in the room to the right of 
the goal location. The performance would be close to optimal if the goal location had some 
distinct features. 

In Environment A, the agent, again under the guidance of the radius-1 approximate 
policy, was able to reach and declare the goal successfully once it got to the neighborhood. 
However, it often took many unnecessarily steps before reaching the neighborhood due to 
uncertainty in the effects of the turning actions. For example, when the agent reached 
the lower left corner from above, it was facing downward. The agent executed the action 
turn_lef t. Fifteen percent of the time, it ended up facing upward instead of to the right. 
The agent then decided to move-forward, thinking that it was approaching the goal. But 
it was actually moving upward and did not realize this until a few steps later. The agent 
would perform much better if there were informative landmarks around the corners. 

14. Conclusions 

We propose to approximate a POMDP by using a region observable POMDP. The region 
observable POMDP has more informative observations and hence is easier to solve. A 
method for determining approximation quality is described, which allows one to make the 
tradeoff between approximation quality and computational time by starting with a coarse 
approximation and refining it gradually. Simulation experiments have shown that when 
there is not much uncertainty in the effects of actions and observations are informative, 
a POMDP can be accurately approximated by a region observable POMDP that can be 
solved exactly. However, this becomes infeasible as the degree of uncertainty increases. 
Other approximate methods need to be incorporated in order to solve region observable 
POMDPs whose radiuses are not small. 
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Appendix A: Proofs of Propositions 2 and 3 

Lemma 6 Suppose W and X are two sets of state space functions. If W covers X , then 
for any non-negative function f(s), 

maxa^w a{s)f{s) = maxp^x ^ (3{s)f{s).a 

s s 

Proof of Proposition 2: Because of Proposition 1 and Lemma 2, it suffices to show that 
2a,o+ covers Qa,o+- By the definition of Qa,o+ and Equation (10), we have, for any belief 
state b, that 

Qa,o+{b) = maXa^y^_^Ylh^(^i^+)Pi^+^0+\^^(^)Ms) 

s s+ 

= 7 maXa^y^_^Ya{s+)[Yb{s)P{s+,o+\s,a)] 

s+ s 

Since Vt-i covers Vt-i and the term within the square brackets is a non-negative function 
of s_|., by Lemma 6 we have 

Qa,o+{b) = 7 maa;^g-p^_^ ^a(s+)[^6(s)P(s+,o+|s,a)] 

s+ s 

= max^^^^^ J2 "(s+)-P(s+, o+|s, a)]b{s) 
s s+ 

where the last equation is due to the definition of Q'a^o+ ^^'^ Equation (10). So, Qa^o+ does 
cover Qa,o+- The proposition is proved. □ 

Lemma 7 For any observation z^={o^, R^) of the region observable POMDP A4', 

P{s+,z+\s,a) = 0, 

for any s+^R+. □ 

Informally, this lemma says that the true state of the world must be in the region reported 
by the oracle. 

Lemma 8 Let W and X be two sets of state space functions and R be a region. IfW covers 
X in region R, then for non-negative function f{s) that is when s^R, we have 

maxa^w a{s)f{s) = maxp^x Y f^i^)fi^)-^ 

s s 
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Proof of Proposition 3: Because of Proposition 1 and Lemma 4, it suffices to show that 
Qa,z+,R covers Qa,z+ in region i?, where Qa,z+ is given by (11) with o+ and Vt-i replaced 
by z+ and Ut-i- 

Let b be any belief state in Br. Similar to the proof of Theorem 2, we have 

Qa,z+ib) = 7 maa;a£Zi(_i 

s+ s 
' s+ s 
s&R s+ 
s 

= Qa,z+,n{h), 

where the second equation is true because of the fact that Ut-i,R_^ covers Ut-i in region 
and of Lemma 8. The term within the square brackets is a non-negative function of and 
it is when s+^R+ because of Lemma 7. The fourth equation is true because that 6(s)=0 
when s^R. The proposition is proved. □ 



Appendix B: Domination and Pruning 

This appendix describes implementation of the procedures dominate (a, W, R, e), purge (W, i?), 
and purge (W). They were not given in the main text because they are minor adaptations 
of existing algorithms. 

The procedure dominate (a, W, i?, e) takes, as inputs, a state space function «, a set of 
state space functions W, a region R, and a nonnegative number e. It returns a belief state 
h in Br such that a(h)>W{h)+e. If such a belief state does not exist, it returns nil. It can 
be implemented as follows. 

Procedure dominate (a, W, i?, e) 

• Inputs: a — A state space function, 

W — A set of state space functions, 

R — A region, e — A nonnegative number. 

• Output: A belief state in Br or nil. 

1. If W=0, return an arbitrary belief state in Br. 

2. Solve the following linear program: 

Variables: a;, h{s) for each s^R. 

Maximize: x 

Constraints: 

X] a{s)h{s) >x+Y^ (3{s)b{s) for all 
seR seR 

E ^(^) = 1 

sGR 

b{s) > for all seR. 
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3. If x<e, return nil, else return b. 

The procedure purge(W,it!) takes a set of state space functions W and a region R 
and returns a set of state space functions that parsimoniously covers W in region R. To 
implement it, we need two subroutines. 

A state space function a pointwise dominates another state space function /3 in a region 
R if a{s)>(3{s) for all sER. The subroutine pointwisePurge(W, i?) returns a minimal 
subset W of W such that each state space function in W is pointwise dominated in the 
region R by at least one state space function in W. Implementation of this subroutine is 
straightforward. 

The subroutine best (6, W, R) returns a state space function « in W such that J2seR H-'^)ci{s) > 
Xlsg^ for any other state space function /3 in W. Implementation of the subroutine 

is straightforward except for the issue of tie breaking. If the ties are not broken properly, 
purge (W, R) might return a regional covering of W that is not parsimonious. A correct way 
to break ties is as follows: Fix an ordering among states in R. This induces a lexicographic 
ordering among all state space functions. Among the tied state space functions, chose the 
one that is the largest under the lexicographic ordering (Littman, 1994). 

The following implementation of purge is based on Lark's algorithm (White, 1991). 

Procedure purge (>V,i?) 

• Inputs: W — A set of state space functions, 

R — A region. 

• Output: A set of state space functions that parsimoniously covers W in 
region R. 

1. W-^pointwisePurge(W, i?). 

2. X^tD. 

3. While 

(a) Pick a state space function a from W. 

(b) dominate(Q!, ^l", i?, 0). 

(c) If b =nil, remove a from W. 

(d) Else remove best(6, W, i?) from W and add it to X. 

4. Return X. 

Finally, the procedure purge (W) takes a set of state space functions W and returns a 
parsimonious covering of W. It can be implemented simply as follows. 

Procedure purge (W): 

• purge (W, 5). 

Here, S is the set of all possible states of the world. 
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